The analysis examines trends in Business Analytics, Data Science, and Machine Learning job postings, with a focus on the skills required for these roles. The study evaluates how varying skill combinations influence salary levels, remote work availability, and career progression pathways.
This analysis applies three machine learning approaches to job posting data: clustering to group roles by skill requirements, regression to examine how skills and experience influence salary, and classification to distinguish ML/Data Science positions from Business Analytics and other jobs. Using 25 technical skills along with experience and remote work indicators, the analysis shows that Business Analytics dominates the market (35% of roles), while ML and DS remain smaller but specialized segments. Results highlight that experience is the strongest salary driver, jobs fall into six clear clusters with different pay and remote work patterns, and BA, ML, and DS roles each display distinct skill signatures that make them easy to differentiate
2 Data Loading and Setup
The analysis starts by loading the Lightcast job postings dataset and identifying relevant skill columns. The dataset contains comprehensive information about job postings including titles, salaries, required skills, and other job characteristics.
Code
import pandas as pdimport numpy as npimport plotly.express as pximport plotly.graph_objects as gofrom plotly.subplots import make_subplotsimport plotly.io as pioimport jsonimport refrom collections import Counterpio.templates.default ="plotly_white"pio.renderers.default ="notebook"# Load data from csvdf = pd.read_csv("data/lightcast_job_postings.csv", low_memory=False)print(f"Dataset loaded: {len(df):,} rows, {len(df.columns)} columns")# print(df.head())
Dataset loaded: 72,498 rows, 131 columns
2.1 Important Skills columns
The dataset contains multiple skill-related columns. After examining the schema, the columns ‘SKILLS_NAME’, ‘SOFTWARE_SKILLS_NAME’ and ‘SPECIALIZED_SKILLS_NAME’ provide the most detailed skill information for this analysis. These columns list the specific technical skills mentioned in each job posting.
3 Skills Data Preprocessing
The next step involves filtering the data to include only records with valid salary and title information. Then, binary features are created for 25 key technical skills covering ML, Data Science, and Business Analytics domains to enable machine learning analysis.
Code
# Apply filtersdf_filtered = df.dropna(subset=['SALARY', 'TITLE'])# Convert salary to numeric and filterdf_filtered['SALARY'] = pd.to_numeric(df_filtered['SALARY'], errors='coerce')df_filtered = df_filtered[df_filtered['SALARY'] >0]print(f"Records after filtering: {len(df_filtered):,}")df_skills = df_filtered.copy()# Focus on key Business Analytics/ML/Data Science skills. Key skills for# BA/ML/DS roles identified manually.key_skills = ['Python (Programming Language)','R (Programming Language)','SQL (Programming Language)','Machine Learning','Data Science','Data Analysis','Statistics','Artificial Intelligence','TensorFlow','PyTorch (Machine Learning Library)','Pandas (Python Package)','NumPy (Python Package)','Scikit-Learn (Python Package)','Big Data','Apache Spark','Apache Hadoop','Amazon Web Services','Microsoft Azure','Google Cloud Platform (Gcp)','Data Visualization','Tableau (Business Intelligence Software)','Power BI','Natural Language Processing (NLP)','Computer Vision','Deep Learning' ]print(f"Using focused {len(key_skills)} BA/ML/DS technical skills for analysis")# Create binary features for each key skill.for skill in key_skills:# Clean skill name for column naming# Eg: R (Programming Language) --> has_r_programming_language skill_col_name =f'has_{skill.lower().replace(" ", "_").replace("-", "_").replace("(", "").replace(")", "")}' df_skills[skill_col_name] = ( df_skills['SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) | df_skills['SOFTWARE_SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) | df_skills['SPECIALIZED_SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) ).astype(int)print("Binary skill features created")# Create ML/DS role indicator using focused skillscore_ml_skills = ['has_machine_learning', 'has_artificial_intelligence', 'has_tensorflow', 'has_pytorch_machine_learning_library','has_deep_learning', 'has_natural_language_processing_nlp', 'has_computer_vision']core_ds_skills = ['has_python_programming_language', 'has_r_programming_language', 'has_statistics','has_data_science', 'has_pandas_python_package', 'has_numpy_python_package','has_scikit_learn_python_package', 'has_big_data']core_ba_skills = ['has_data_analysis', 'has_data_visualization', 'has_sql_programming_language','has_tableau_business_intelligence_software', 'has_power_bi']# Role indicators# ML roles are straightforward.df_skills['is_ml_role'] = ( (df_skills[core_ml_skills].sum(axis=1) >0)).astype(int)# R language is primarily associated with Data Science field. So,# if job requires R language or if it has more than one data science# skills then it is considered DS role.df_skills['is_ds_role'] = ( df_skills['has_r_programming_language'] ==1| (df_skills[core_ds_skills].sum(axis=1) >1)).astype(int)# Business Analytics roles typically require SQL, visualization tools (Tableau, Power BI)# and data analysis capabilities. If job has more than two BA skills, consider it a BA role.df_skills['is_ba_role'] = ( df_skills[core_ba_skills].sum(axis=1) >=2).astype(int)# Remote work indicatordf_skills['is_remote'] = df_skills['REMOTE_TYPE'].fillna(0).astype(int)df_skills['experience_years'] = df_skills['MIN_YEARS_EXPERIENCE'].fillna(0)df_final = df_skillsprint(f"Final dataset size: {len(df_final):,}")print(f"ML roles identified: {df_final['is_ml_role'].sum():,}")print(f"Data Science roles identified: {df_final['is_ds_role'].sum():,}")print(f"Business Analytics roles identified: {df_final['is_ba_role'].sum():,}")
Records after filtering: 30,808
Using focused 25 BA/ML/DS technical skills for analysis
Binary skill features created
Final dataset size: 30,808
ML roles identified: 3,226
Data Science roles identified: 2,877
Business Analytics roles identified: 10,831
For each of the 25 key skills, a binary indicator variable is created (1 if the skill is mentioned, 0 otherwise). This transforms the text skill data into numerical features suitable for machine learning models.
3.1 Role Classification Logic
Three role categories are identified based on technical skills:
ML roles: Require advanced ML/AI skills like TensorFlow, PyTorch, Deep Learning, NLP, Computer Vision
Data Science roles: Require R programming, Python with Statistics, or multiple data science tools (Pandas, NumPy, Scikit-learn)
Business Analytics roles: Require SQL, data analysis, visualization tools (Tableau, Power BI), typically 2+ BA skills
The analysis examines how these specialized skills impact salary and career opportunities. Machine learning models are used to find patterns that can guide job seekers in choosing which skills to develop.
4 Feature Engineering for ML
Before building models, the dataset is prepared by selecting relevant columns. This includes the salary (target variable), skill indicators, remote work status, and experience years.
Code
# Just prepare the modeling datasetmodeling_cols = ['SALARY', 'is_ml_role', 'is_ds_role', 'is_ba_role', 'is_remote', 'experience_years'] +\ [col for col in df_final.columns if col.startswith('has_')]df_modeling = df_final[modeling_cols].copy()print("Features for modeling:")print(f"Dataset shape: {df_modeling.shape}")print(f"Columns: {list(df_modeling.columns)}")print(f"Missing values: {df_modeling.isnull().sum().sum()}")
The modeling dataset now contains binary skill features, experience, remote work indicator, and salary information. This structured format allows application of various machine learning techniques.
5 Unsupervised Learning:
5.1 KMeans Clustering Based on Skills
The first machine learning approach uses KMeans clustering to discover natural groupings in the job market. This unsupervised technique groups jobs with similar skill profiles together, without using salary information. The goal is to see if jobs naturally segment into distinct categories based on their requirements.
Code
from sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScaler, LabelEncoderfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.ensemble import RandomForestRegressor, RandomForestClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import mean_squared_error, r2_score, accuracy_score, f1_score, confusion_matrix, classification_report# Prepare features for clustering using skills and other featuresskill_feature_cols = [col for col in df_modeling.columns if col.startswith('has_')]print(f"Available skill features: {len(skill_feature_cols)}")# Base clustering featuresclustering_features = skill_feature_cols + ['experience_years', 'is_remote']# Encode ONET and NAICS6.le_onet = LabelEncoder()df_modeling['onet_encoded'] = le_onet.fit_transform(df_final['ONET'].fillna('Unknown'))clustering_features.append('onet_encoded')le_naics = LabelEncoder()df_modeling['naics_encoded'] = le_naics.fit_transform(df_final['NAICS6'].fillna('Unknown'))clustering_features.append('naics_encoded')# Prepare clustering dataX_cluster = df_modeling[clustering_features].fillna(0)# Scale featuresscaler_cluster = StandardScaler()X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)# KMeans clusteringkmeans = KMeans(n_clusters=6, random_state=42, n_init=10)clusters = kmeans.fit_predict(X_cluster_scaled)df_modeling['cluster'] = clusters# print("Skills based clustering completed")# print("Cluster centers:")# for i, center in enumerate(kmeans.cluster_centers_):# print(f"Cluster {i}: {center}")
Available skill features: 25
The clustering model groups similar jobs together using skill patterns, experience requirements, and job characteristics. The algorithm assigns each job to one of 6 clusters. Now the characteristics of each cluster can be examined to understand what makes them distinct.
Code
# Analyze clustering.cluster_summary = df_modeling.groupby('cluster').agg({'SALARY': ['count', 'mean'],'is_ml_role': 'mean','is_ds_role': 'mean','is_ba_role': 'mean','is_remote': 'mean','experience_years': 'mean'}).round(2)cluster_summary.columns = ['count', 'avg_salary', 'ml_role_pct', 'ds_role_pct', 'ba_role_pct','remote_percentage', 'avg_experience']cluster_summary = cluster_summary.reset_index()# Compute combined BA/ML/DS percentage on-the-fly# A job has BA/ML/DS if it has any of the three role typescluster_summary['ml_ds_ba_combined_pct'] = cluster_summary.apply(lambda row: ((df_modeling[df_modeling['cluster'] == row['cluster']][['is_ml_role', 'is_ds_role', 'is_ba_role']].sum(axis=1) >0).mean()), axis=1).round(2)print("Skills based Cluster Summary:")print(cluster_summary)# Visualize cluster characteristics.fig = make_subplots( rows=2, cols=3, subplot_titles=('Cluster Size', 'Average Salary', 'BA/ML/DS Role %','Remote Work %', 'Avg Experience', 'Salary Distribution'), specs=[[{"type": "bar"}, {"type": "bar"}, {"type": "bar"}], [{"type": "bar"}, {"type": "bar"}, {"type": "scatter"}]])fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['count'], name="Count"), row=1, col=1)fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['avg_salary'], name="Avg Salary"), row=1, col=2)fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['ml_role_pct'], name="ML %"), row=1, col=3)fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['ds_role_pct'], name="DS %"), row=1, col=3)fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['ba_role_pct'], name="BA %"), row=1, col=3)fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['remote_percentage'], name="Remote %"), row=2, col=1)fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['avg_experience'], name="Experience"), row=2, col=2)# Salary distribution by cluster.fig.add_trace( go.Scatter( x=df_modeling['cluster'], y=df_modeling['SALARY'], mode='markers', opacity=0.6, name="Jobs" ), row=2, col=3)fig.update_layout( height=650, showlegend=False, template="plotly_white", title={'text': "Skills-Based KMeans Clustering Results",'y': 0.98,'x': 0.5,'xanchor': 'center','yanchor': 'top' }, margin=dict(t=80))fig.show()
The clustering analysis grouped jobs based on their skill requirements and characteristics. The analysis identified 6 distinct job clusters, each with different salary levels, remote work availability, and skill profiles.
Key Findings:
Business Analytics dominates: 10,831 BA roles vs. 3,226 ML and 2,877 DS
BA-focused growth: Cluster 3 ($109K) — strong BA demand with DS hybrid edge
Specialist track: Cluster 4 ($140K) — pure ML, fewer jobs but high pay
Hybrid advantage: Cluster 0 ($140K) and Cluster 5 ($118K, 56% remote) — multi-skill roles with flexibility
6 Supervised Learning:
6.1 Multiple Regression
The second approach uses supervised learning to predict salary based on skills and experience. Two regression models are trained: Multiple Linear Regression and Random Forest. This analysis identifies which skills and factors most strongly influence compensation.
Code
# Identify regression features.# Focus on skills (not role labels) to understand how skills directly affect salaryregression_features = skill_feature_cols + ['experience_years', 'is_remote']# Prepare regression data using salary as the target variableX_reg = df_modeling[regression_features].fillna(0)y_reg = df_modeling['SALARY']X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)print(f"Training set size: {len(X_train):,}")print(f"Test set size: {len(X_test):,}")# Scale featuresscaler_reg = StandardScaler()X_train_scaled = scaler_reg.fit_transform(X_train)X_test_scaled = scaler_reg.transform(X_test)# Multiple Linear Regressionlr = LinearRegression()lr.fit(X_train_scaled, y_train)# Random Forest Regressionrf_reg = RandomForestRegressor(n_estimators=100, random_state=42)rf_reg.fit(X_train_scaled, y_train)print("Skills based regression models training completed")
Training set size: 24,646
Test set size: 6,162
Skills based regression models training completed
Both models are trained on 80% of the data and will be evaluated on the remaining 20% test set. The Random Forest model can capture non-linear relationships and interactions between skills, while Multiple Linear Regression provides a baseline for comparison.
Code
# Evaluate regression models# Multiple Linear Regression predictionsy_pred_lr = lr.predict(X_test_scaled)rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))r2_lr = r2_score(y_test, y_pred_lr)# Random Forest predictionsy_pred_rf = rf_reg.predict(X_test_scaled)rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))r2_rf = r2_score(y_test, y_pred_rf)print("Skills-based Regression Model Performance:")print(f"Multiple Linear Regression - RMSE: ${rmse_lr:,.2f}, R²: {r2_lr:.4f}")print(f"Random Forest - RMSE: ${rmse_rf:,.2f}, R²: {r2_rf:.4f}")# Feature importance for Random Forest# Only use features that actually exist in the modelactual_feature_names = [col for col in regression_features if col in X_train.columns]importances = rf_reg.feature_importances_# Visualize feature importancefig = px.bar(x=actual_feature_names, y=importances, title="Skills Impact on Salary (Random Forest Feature Importance)", labels={'x': 'Features', 'y': 'Importance'})fig.update_layout(template="plotly_white", xaxis_tickangle=-45)fig.show()# Top skills by salary impactskill_importance =list(zip(actual_feature_names, importances))skill_importance.sort(key=lambda x: x[1], reverse=True)print("\nTop skills by salary impact:")for skill, importance in skill_importance[:10]:print(f"{skill}: {importance:.4f}")
Skills-based Regression Model Performance:
Multiple Linear Regression - RMSE: $37,899.01, R²: 0.2780
Random Forest - RMSE: $32,558.54, R²: 0.4672
Top skills by salary impact:
experience_years: 0.4932
is_remote: 0.0728
has_data_analysis: 0.0426
has_tableau_business_intelligence_software: 0.0372
has_amazon_web_services: 0.0361
has_sql_programming_language: 0.0350
has_statistics: 0.0302
has_python_programming_language: 0.0300
has_machine_learning: 0.0265
has_data_science: 0.0261
6.1.1 Regression Analysis: What drives salary?
Prediction models were built to understand how skills influence salary. The Random Forest model achieved R2 of 0.47 compared to 0.28 for Multiple Linear Regression, showing that skill-salary relationships are complex.
Model Performance:
Random Forest: R² = 0.47 (explains 47% of salary variation), RMSE = $32,559
Multiple Linear Regression: R² = 0.28
Insight: Skills alone do not fully explain salary — other factors also matter.
Key Salary Drivers (Feature Importance):
Experience (0.49): Largest factor, nearly half of salary variation
Remote work (0.07): Flexibility influences pay differences
Data Analysis (0.04): Core analytical capability
Tableau (0.04): Visualization and BI tool
AWS (0.04): Cloud computing platform
SQL (0.04): Database querying and manipulation
Statistics (0.03): Analytical foundation
Python (0.03): Programming language
Career Implications:
Experience is critical — the strongest driver of salary.
Remote work adds value — flexibility can boost compensation.
Skill combinations matter — technical, analytical, and cloud skills together shape salary outcomes.
Summary: Salary is not determined by skills alone. Experience and work flexibility are key, while technical skills provide additional differentiation.
6.2 Classification to Identify BA/ML/DS Roles
Although the project required only one of the supervised learning models. This analysis also explores the classification to distinguish ML/Data Science roles from Business Analytics and other positions. A Random Forest Classifier is trained to predict whether a job is an ML/DS role based on its skill requirements. This analysis reveals which skills are the strongest “signature” indicators that distinguish ML/DS positions from BA roles.
Code
# Prepare features for classification.classification_features = skill_feature_cols + ['experience_years', 'is_remote']# Prepare classification dataX_clf = df_modeling[classification_features].fillna(0)# Target: ML/DS roles (computed from is_ml_role OR is_ds_role)y_clf = ((df_modeling['is_ml_role'] ==1) | (df_modeling['is_ds_role'] ==1)).astype(int)# Train/test split for classificationX_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)# Scale featuresscaler_clf = StandardScaler()X_train_clf_scaled = scaler_clf.fit_transform(X_train_clf)X_test_clf_scaled = scaler_clf.transform(X_test_clf)# Random Forest Classificationrf_clf = RandomForestClassifier(n_estimators=100, random_state=42)rf_clf.fit(X_train_clf_scaled, y_train_clf)print("Skills-based classification model trained successfully!")
Skills-based classification model trained successfully!
The classifier learns patterns that distinguish ML/DS roles from BA and other positions based on their skill profiles. The model is now evaluated to see how accurately it can identify these specialized ML/DS roles versus the more common BA positions.
Code
# Random Forest predictionsy_pred_rf_clf = rf_clf.predict(X_test_clf_scaled)accuracy_rf = accuracy_score(y_test_clf, y_pred_rf_clf)f1_rf = f1_score(y_test_clf, y_pred_rf_clf)print("Skills based Classification Model Performance:")print(f"Random Forest - Accuracy: {accuracy_rf:.4f}, F1 Score: {f1_rf:.4f}")# Confusion Matrix for Random Forestcm = confusion_matrix(y_test_clf, y_pred_rf_clf)# Visualize confusion matrixfig = px.imshow(cm, text_auto=True, aspect="auto", title="Confusion Matrix - ML/DS Role Classification", labels=dict(x="Predicted", y="Actual"), color_continuous_scale="Blues")fig.update_layout(template="plotly_white")fig.update_xaxes(tickvals=[0,1], ticktext=['Not ML/DS', 'ML/DS'])fig.update_yaxes(tickvals=[0,1], ticktext=['Not ML/DS', 'ML/DS'])fig.show()print("Classification Report:")print(classification_report(y_test_clf, y_pred_rf_clf))# Only use features that actually exist in the classification modelclf_actual_feature_names = [col for col in classification_features if col in X_train_clf.columns]clf_importances = rf_clf.feature_importances_# Visualize classification feature importancefig = px.bar(x=clf_actual_feature_names, y=clf_importances, title="Skills Impact on ML/Data Science Role Classification", labels={'x': 'Features', 'y': 'Importance'})fig.update_layout(template="plotly_white", xaxis_tickangle=-45)fig.show()
Skills based Classification Model Performance:
Random Forest - Accuracy: 0.9995, F1 Score: 0.9986
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 5082
1 1.00 1.00 1.00 1080
accuracy 1.00 6162
macro avg 1.00 1.00 1.00 6162
weighted avg 1.00 1.00 1.00 6162
A Random Forest classifier was used to predict whether a job is an ML/Data Science role based on its skill requirements. The model achieved very strong performance in separating ML/DS roles from Business Analytics and other positions.
Model Performance:
Accuracy: 99.95% — nearly all ML/DS roles correctly identified
Insight: ML/DS roles have distinct skill patterns compared to BA and general analyst jobs
Conclusion: Skill-based criteria effectively distinguish ML/DS roles from BA positions
Key Predictive Skills (Feature Importance)
Programming: Python, R
ML Frameworks: TensorFlow, PyTorch
Statistical Modeling: Core differentiator for ML/DS
BA-Oriented Skills: SQL, Tableau, Power BI, Data Analysis (more common in BA roles)
Career Implications
Distinct skill sets: ML/DS roles require clearly different capabilities than BA roles
ML/DS focus: Programming, modeling, and ML frameworks are the strongest signals
BA focus: SQL, visualization, and reporting tools dominate BA roles
Career development: Building expertise in high-importance ML/DS features directly improves readiness for ML/DS positions
Summary:The Random Forest classifier confirms that ML/DS roles are defined by specialized technical skills, while BA roles emphasize analysis and visualization tools. This distinction provides a clear roadmap for professionals aiming to transition into ML/DS careers.
7 Model Results Visualization
This section provides a consolidated view of all three modeling approaches. The comparison shows how different models perform on their respective tasks and highlights the most impactful skills across different analyses.
Code
# Summarize core model performancemodel_summary = pd.DataFrame({'Model': ['Multiple Linear Regression', 'Random Forest (Regression)', 'Random Forest (Classification)'],'R² / Accuracy': [r2_lr, r2_rf, accuracy_rf],'RMSE / F1 Score': [rmse_lr, rmse_rf, f1_rf]})print(model_summary)# Visualization of model resultsfig = make_subplots( rows=1, cols=2, subplot_titles=('Model Performance Comparison', 'Skills vs Salary Impact'), specs=[[{"type": "bar"}, {"type": "bar"}]])# Model performance comparisonmodels = ['Multiple Linear Regression', 'Random Forest Regression', 'Random Forest Classification']metrics = [r2_lr, r2_rf, accuracy_rf]fig.add_trace(go.Bar(x=models, y=metrics, name="Performance"), row=1, col=1)# Skills vs salary impacttop_skills_salary = skill_importance[:8]fig.add_trace(go.Bar(x=[s[0] for s in top_skills_salary], y=[s[1] for s in top_skills_salary], name="Salary Impact"), row=1, col=2)fig.update_layout( height=450, showlegend=False, template="plotly_white", title={'text': "Core Model Results - BA/ML/DS Skills Analysis",'y': 0.98,'x': 0.5,'xanchor': 'center','yanchor': 'top', }, margin=dict(t=80))fig.show()
Model R² / Accuracy RMSE / F1 Score
0 Multiple Linear Regression 0.278032 37899.005358
1 Random Forest (Regression) 0.467166 32558.537199
2 Random Forest (Classification) 0.999513 0.998609
8 Key Takeaways and Recommendations
8.1 Summary of Findings
Our analysis of business analytics, data science and machine learning job postings reveals several important patterns:
Role Distribution: Business Analytics dominates (35% of jobs), while ML and DS remain smaller but specialized segments.
Job Segmentation: Six distinct clusters reveal clear differences in pay, experience, and hybrid skill mixes.
Salary Drivers: Experience is the strongest factor (49%), with remote work and technical skills adding incremental impact.
Role Differentiation: ML/DS roles are highly distinct, with classification accuracy of 99.95% separating them from BA roles.
8.2 Recommendations for Job Seekers
For Career Advancement:
Gain experience - it’s the single biggest salary driver (49% importance)
Remote work flexibility - BA/ML/DS roles pay well even when remote, showing that onsite presence is not necessary for competitive salaries.